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Independent component analysis (ICA) has been widely used for 
blind source separation in many fields such as brain imaging analysis, 
signal processing and telecommunication. Many statistical techniques 
based on M-estimates have been proposed for estimating the mixing 
matrix. Recently, several nonparametric methods have been devel- 
oped, but in-depth analysis of asymptotic efficiency has not been 
available. We analyze ICA using semiparametric theories and pro- 
pose a straightforward estimate based on the efficient score function 
by using B-spline approximations. The estimate is asymptotically ef- 
ficient under moderate conditions and exhibits better performance 
than standard ICA methods in a variety of simulations. 

1. Introduction. Independent component analysis (ICA) aims to sepa- 
rate independent blind sources from their observed linear mixtures with- 
out any prior knowledge. This technique has been widely used in the past 
decade to extract useful features from observed data in many fields such as 
brain imaging analysis, signal processing and telecommunication. Hyvari- 
nen, Karhunen and Oja [16] described a variety of applications of ICA. For 
example, Vigario, Jousmaki, Hamalainen, Hari and Oja [25] used ICA to 
separate artifacts from magnetoencephalography (MEG) data, without the 
burden of modeling the process that generated the artifacts. 

Standard ICA represents an m x 1 random vector X (e.g., an instanta- 
neous MEG image) as linear mixtures of m mutually independent random 
variables {Si, . . . , Sm) (e.g., artifacts and other brain activities), where the 
distribution of each Si is totally unknown. That is, for S = {Si, . . . , Sm)"^ 
and some m x m nonsingular matrix W, 
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Here, W is called the mixing matrix. Given n i.i.d. observations, X , . . . , X", 
from the distribution of X, the aim is to estimate W and, thus, to separate 
each component of S = WX such that the components are maximally mu- 
tually independent. W is called the unmixing matrix. This can be seen as a 
projection pursuit problem [14] in which m directions are sought such that 
the corresponding projections are most mutually independent. 

It was shown by Comon [9] that W is identifiable up to scaling and per- 
mutation of its rows if at most one Si is Gaussian [18]. Model (1.1) can be 
viewed as a semiparametric model with parameters (W, ri, . . . , r-m), where 
is the probability density function (PDF) of Si. Our interest centers on W; 
(ri, . . . , r^) are nuisance parameters. 

ICA was motivated by neurophysiological problems in the early 1980's 
(see [16]) and two classes of methods have been proposed to estimate W. 
One class involves specifying a particular parametric model for each r, and 
then optimizing contrast functions that involve {W,ri, . . . ,rm). Primary ex- 
amples of this approach are maximum likelihood (ML) (e.g., [19, 21]) or, 
equivalently, minimum mutual information (e.g., [9]), minimizing high-order 
correlation between components of WX (e.g., [7]) and maximizing the non- 
Gaussianity of WX (e.g., [15]). A second class of methods view ICA as a 
semiparametric model and assume nothing about the distributions of the 
components Si. Thus, two distinct goals can be formulated: (i) to find esti- 
mates of that are consistent or, even better, \/n-consistent — that is, 
W = W + Op(n~^/^) and (ii) to find procedures that achieve the information 
bound — that is, estimates of W which are asymptotically normal and have 
smallest variance-covariance matrix among all estimates that are uniformly 
asymptotically normal in a suitable sense; see [5]. Amari [1] formally demon- 
strated that to achieve the information bound in this situation, a method 
must estimate the densities of the sources. In fact, it can even be shown 
[6] that for any fixed estimating equation corresponding to maximizing an 
objective function, there is a possible distribution of sources for which the 
global maximizer is inconsistent, despite the consistency of a local solution 
near the truth. 

Recently, some nonparametric methods to estimate W have appeared. 
For example. Bach and Jordan [3] proposed: (i) To reduce the dimension 
of the data by using a kernel representation and (ii) to choose W so as to 
minimize the empirical generalized variance among the components of WX. 
Hastie and Tibshirani [13] proposed maximizing the penalized likelihood as 
a function of (W,ri, . . . ,rm) and Vlassis and Motomura [26] proposed max- 
imizing the likelihood by using Gaussian kernel density estimation. Various 
performance analyses have been carried out using simulations. The Vlassis- 
Motomura and Hastie-Tibshirani methods are of the same type as ours, 
but these papers do not provide a method for tuning the procedures and 
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nothing has been proven about their asymptotic properties. Samarov and 
Tsybakov [22] proposed and analyzed a \/n-consistent estimate of W un- 
der mild conditions. Chen and Bickel [8] analyzed the method of Eriksson 
and Koivunen [12] based on characteristic functions and showed it to be 
consistent under the minimal identifiability conditions and -^/n-consistent 
under additional mild conditions. This paper concerns the construction of 
efficient estimates. We develop an efficient estimator by using efficient score 
functions after starting the algorithm at a consistent point based on the 
PCFICA algorithm of Chen and Bickel [8]. 

The outline of the paper is as follows. In Section 2 we analyze ICA as a 
semiparametric model and propose a new method to estimate W using the 
efficient score function. The main theorem is given in Section 3. Numerical 
studies are given in Section 4. Technical details are provided in Sections 5 
and 6. 

Notation. In this paper, W denotes an m x m matrix and Wi and Wij 
denote the zth row and the {i,j)th element of W, respectively. denotes 
the transpose of a matrix A and A~'^ denotes the transpose of A~^. For 

any matrix A with column vectors {a,, -1 <i < k}, \\A\\f = \J tr{A^ A) and 

vec{A) = (a^, a^, . . . , a^^)^, a column vector created from A. Define the sup- 
norm as |/|oo = sup4gig|/(t)| . X* denotes the ith random sample from the 
distribution of X. The population (empirical) law of X is denoted by P 
{Pn)- Xi and Si denote the element of X and 5, respectively. Denote 
the vector of density functions (ri,...,rm) by ri:^. A vector or matrix of 
functions is denoted in boldface. For a vector of functions B, BB"^(x) shall 
be used as an abbreviation of B(x)[B(x)]-^. 

2. Semiparametric inference. In this section, we first briefly review the 
salient features of estimation in semiparametric models and then show how 
to solve an approximate efficient score equation for estimating W in the ICA 
model. 

2.1. Efficient estimates for semiparametric models. Given a semipara- 
metric model, X^, . . . , X^ i.i.d. {P(e,ri) '-0 ^Q.C M'^, ry G <£^}, where is a sub- 
set of a function space, estimates 9^ of 6 are called regular if ^/n{9n — 0) 
converges in law uniformly in P[en,rin)^ where {On,r]n) converges to (Oq^tjo) 
in a smooth way. Then if there is a regular estimate that is uniformly best 
(call it 0*), it must have the form 

1 " 

(2.1) e: = 9 + -Y.l{X\e, v) + Op(n-V2) 

1=1 

under P[e,ri)- The function 1 is called the efficient influence function in [5]. 
When rj = {rji, . . . , r/^/) is a Euclidean parameter, 1 is, under regularity con- 
ditions, the influence function of the ML estimator (MLE) of 9. That is, if 
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i = (^T) ^j") • ■ • ) '^fj)'^ ■> where I is the log-hkehhood function of a single ob- 
servation and l{9,r]) = ElV{X,6,rj) is the Fisher information matrix, then 1 
is the first d coordinates of the vector 1~^{6, 7])l. An alternative formulation 
is to begin by defining the efficient score function 1* = (IJ, . . . , l*^)'^ with 

where ajk{e,il) minimizes E{^{X,e,7]) - j:f=iajk{e,7])^{X,e,r]))\ 
That is, 1* is the projection of ^{X,9,r]) onto the orthocomplement of 
span{^{X,e,r]):l<j<d'}. Then 

i = {E[n*^{x,e,7])])-h*. 

When r] is infinite-dimensional, the generalization of span{^-(X, 9,r]) :! < 
j < d'} is the tangent space. That is defined to be the closed linear span of 
{■^{X, 9, ?7(A))|a=o '■ ^(0) = rj and A — > 77(A) defines a smooth one-dimensional 
submodel {-P(0,j;(a)) • l-^l < 1}} Now, 1* is again obtained by pro- 

jection onto the orthocomplement of this span. An extensive discussion of 
tangent spaces and the geometric interpretation of formulas such as the one 
above is given in [5], Chapters 2 and 3. For many canonical semiparametric 
models including ICA, 1* can be computed; we sketch the argument in the 
Appendix. Suppose that for each 9, an estimate fi(9) is available and is at 
least consistent. Then the usual Taylor expansions suggest that the solution 
of the generalized estimating equation 

n 

(2.2) Y.^*iX\9,m) = 

i=l 

will have an influence function 1 and, hence, be efficient. These heuristics 
and others are discussed in Chapter 7 of [5]. Of course, more than consis- 
tency is needed and after calculating 1* in our case, validating that (2.2) 
leads to (2.1) for a suitable rj{9) is the subject of Sections 3, 5, 6 and the 
Appendix. Note that if 7){9) maximizes J27=i K-^^ ^ ^ ^ v) j then (2.2) simply 
gives the profile maximum likelihood estimate discussed in [20]. In that case, 
(2.2) simplifies, becoming equivalent to 

81 

i=l 

Unfortunately, such fj{9) do not exist in the ICA model. Using 1* instead of 
^ in the estimating equation (2.2) permits a less demanding choice of 'i){9). 
These issues are discussed in detail in [5], Chapter 7. In this paper, we sim- 
ply show that a 9 solving (2.2) for a particular fj[9) does indeed satisfy (2.1) 
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in a sufficiently uniform sense. Optimality of 9 tlien follows from the general 
theory given in Chapter 3 of [5]. 

This technique is different from the quasi-ML method which belongs to 
the first class of methods in the ICA literature reviewed in Section 1. This 
approach is to guess some shape rjo for r] and then use ordinary ML. Of 
course, if r]Q is true, then the resulting estimate is asymptotically Gaussian 
and has smaller variance than the 9 we discuss. But, if ijq is false, then the 
estimate can be inconsistent. The ICA algorithms used for comparison in 
Section 4 such as FastICA [Hyvarinen and Oja (1997)] and extended infomax 
[19] are of this type. Closest to ours in spirit among these is the method of 
Pham and Garrat [21]. They use parametric models such as logsplines (see 
Section 2.3) for the nuisance parameters. However, they propose solving the 
score equations rather than (2.2). More importantly, they do not suggest 
increasing the model dimension with n, do not give a method for selecting the 
number of knots of the splines and, hence, are subject to the inconsistency 
we have discussed. 

The remainder of Section 2 shows how to implement the idea given in (2.2) 
for the ICA model. Technical analysis is carried out in Section 3. 

2.2. Further notation and assumptions. Let Wp be a nonsingular unmix- 
ing matrix such that S = WpX has m mutually independent components. 
Without loss of generality, assume that det(VFp) > 0. For any row vector 
w £ M™", let denote the PDF of wX and (j)^ denote the density score 
function defined by (pwit) = — §i^og fw{t)I {fw{t) > 0), where I(.) is an indi- 
cator function. 

In model (1.1), the order and scaling of rows of W or components of S 
must be constrained for W to be identifiable. For scaling, we take each Si 
to have absolute median 1, that is, -PdSj] < 1) = | or, equivalently. 



Even after this choice, the correct unmixing matrix requires 2"^m\ choices 
due to sign changes and row permutations. This ambiguity can be resolved 
in various ways, but we avoid being specific by assuming that a consistent 
starting value is available for Wp, say PCFICA of Chen and Bickel [8]. Let 
k{s) = 2/(|s| < 1) — 1. Then (2.3) is equivalent to 



(2.3) 





Equation (2.2), for our case, can be written 



n 



i=l 
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Table 1 
Algonthm EFFICA 



1. Input data {X^ , . . . ,X"}, an initial estimate W'"^ and B-spline basis 
functions 6^'=' = {b[''\ . . . for fc = 1, . . . , m; 

2. For j — 0,1, . . . until convergence: 

1) Set Si''> ={Wl:'^X' :{=!,..., n} for fc = 1, . . . ,m; 

2) Set for k — 1, . . . ,m, 

where 7^^' = (B^^' [Bf'''B('=)^])-^£;^^' [D^'^'], D'*) is the derivative function 
of B'-'"' and E'^^ is empirical expectation w.r.t. 5^"*'; 

3) Set at W = W^''': 

^w{s) = i4,['\ar),...,4,'£(ar„)f 

4) Set for k = 1, . . . ,m, 

4^'' vi'^=E[^'>[i;], where ^is)=2sl{\s\ < 1), 

(^i'¥=i5^'''M, where ^(s) = .^ 

5) Set M(-')(s), an m X m function matrix with elements: 

Mj^^]{s) = -$i^\sk)s,,,kj^k' 

= &['^Sk + Pi'\21(\sk\ <l)~l),k = k'; 

6) Set 

r(^''(x) = i;ec(M(^'(H/(^)x)[W'(^']-'^),* 
where vec{M) vectorizes M; 

7) Set 

8) Update W'-'+'-^ = W"'^' + [E$f']-ie$/'. 

*Here, i*^-'' (x) = 1* (x, W, ^w) at = W'-^'' , where can be identified as an 
estimate of the "nuisance parameter" ^w- 



where is the parameter defined by (2.9) and is an estimate of it. 
We give pseudo code in Table 1 for iteratively solving this equation. The 
expressions appearing in the pseudo code are developed in the rest of this 
section. 

2.3. Efficient score function ofW. The likelihood function of X under 
(1.1) can be expressed as 

m 

px i^,W,n.,m) = \ det iW)\l[r,{W,x). 

i=l 
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The parameter of interest is W, while ri-rn are nuisance parameters. For 
simplicity, assume E[Si] = 0. 

Let 4>i{si) = — log ri{si) I {ri{si) > 0) be the density score function as- 
sociated with Tj and define $ by <^(s) = {cl)_i{si), . . . ,4'rn{sm))'^ , where s = 
(si, . ..,SmV. Then the score function oiW, iiy(x) = avec(w) ^°g(P^ ■ 
is equal to 

iw{x) = Vec{{Imxm. - ^{S)S^)W-'^}, 

where s = Wx and Imxm is an m x m identity matrix. Thus, minimal regu- 
larity conditions for efficient estimation are that each should be absolutely 
continuous, W nonsingular and 

E{(j)i{Sif] < oo and E[Sf] < oo. 

Using the devices of tangent space and projection mentioned in Section 2.1 
(see calculation details in the Appendix), the efficient score can be expressed 
as 

(2.4) r(x, W, $) = vec{M{Wx)W-^), 
where M(s) is an m x m function matrix with elements 

(2.5) Mij{s) = -(pi{si)sj, forl<f/j<m, 

(2.6) Mii{s) = aiSi + (3iK{si), fori = l,...,m 
and 

(9 7) - (^-''i'^'^i a - - "^)^»^ ^2 _ prc^2i 

(2.8) Vi = E[2Sa{\Si\<l)], Ui = E[2S,<PiI{\S,\<l)]. 

Most of these formulas were derived in [2], but in a different context. 
We repeat these in our own notation for completeness. By the convolution 
theorem on semiparametric models (see [5]), the information bound for reg- 
ular estimators of W is {E[\*l*'^{X, W, ^)])~^. It is obvious that the efficient 
score function depends on T]^ ; ^2 only through the density score functions 
((/>!, . . . , (pm)- Next, we describe how to perform the estimation by using the 
efficient score function. 

2.4. The ICA estimate. Let 

(2.9) <^w = i(l)Wi,---,(pwj'^ 

and assume that a starting estimate W^^^ is available which is consistent for 
Wp. We shall show how to construct an estimate of for W close to 
Wp and then solve 

J l*{X,W,^w)dPn = 
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to obtain an efficient estimator of Wp. Here, is a data-dependent func- 
tion of W and, thus, 1*{X, W, ^w) is an approximation to the efficient score 
function given by (2.4). 

For each /c G {1, . . . , m}, choose a sieve for (j)w^ as follows. Let [hnki^nk] C 
R be a subset of supp^r^.) containing most of the mass of r^. For an integer 
Hk, set Uk + 4: equally spaced points {6^^ + — l)^n/c : 1 < ^ < "-fc + 4} as 
knots, where Snk depends on through 

^nk = (Kk - knk)/ (nk + 3), 

and then construct cubic B-spline basis functions, as in the Appendix. 
Here, is chosen by cross-validation as described in Section 2.5 below. De- 
note the basis functions as B^f^ = {B^-^ , . . . ,Bnnk)'^, where the superscript 
(k) denotes the association with Sk and the subscript n denotes the depen- 
dence on the sample size. Given the random sample {W^X^ : 1 < i < n} from 
the density function fyy^, the density score function (pw^ can be estimated 
by 

(2.10) 4>w, = bniWk)fB^^\ 

where 7n(Wfc) is given in Table 5 (see Section 2.5 for details). Then define 
$vi/(s) = {(pWiisi), . . . ,(pWm{^m))'^ ■ To avoid further complications, it is as- 
sumed that both [b:f^k,bnk] and are fixed using W^'^K That is, the nk + 4: 
knot locations are fixed. 

Now, replace the efficient score function 1*{X,W,^) defined in (2.4)- 
(2.8) by its profile form r{X,W,^w), where Oj, Pi and af defined in (2.7) 
and (2.8) are replaced with plug-in estimates 

(2.11) a. = Jlz^^ A= ^\;^^jf , af=[{Wa?dPn, 

f^i - vf af - vf J 

where Ui = jY=wx'^'Y^wAY)I{\y\ < ^)dPn and Vi = Jy^^,x2YI{\Y\ < 
l)dPn. 
Define 

{2.12) en{W)= j r{X,W,^w)dPn and e{W) = j \* {X,W,^w) dP 

and let be a solution of 

(2.13) e„(W) = 0, 

if it exists. Let i*(x,VF) = r(x, W,4'iy) and e„(VF) = Q^;i(w)en{W) . Note 
that 11 W ^ Wp, then -en{W) and / \*\*^{X, W) dPn have the same limit. 



de{W) 



dvec{W) 



= E[rr' {x,wp,<^p)], 

Wp 
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with probabihty converging to 1, as demonstrated later in Section 5. The fi- 
nal estimator W is defined as the limiting value of the following approximate 
Newton-Raphson iteration: 

'e„(W^(^)), j = 0,l,.... 

We shall show that this limit exists with probability tending to 1. Note 
that this method does not require the calculation of the Hessian matrix 
en{W). The convergence and asymptotic properties of (2.14) are developed 
in Section 3. CaU W = W^'^^ defined by (2.14) the EFFICA estimate. This 
is summarized in Table 1 and will be used for the simulation in Section 4. 
The density score estimation, as well as how to choose the number of knots 
by cross-validation (mentioned above), is provided in the next subsection. 

2.5. Estimating a density score function by B-spline approximations. Let 
<j) = —r'/r be the density score associated with a univariate PDF r. Let Q 
be a linear space with differentiable basis functions B = (i?i, . . . , Bj^)'^ such 
that each Bir vanishes at infinity. An estimator oi (f) \n Q can then be 
obtained by minimizing with respect to 7 € the mean square error 

c(7)= / {(t){s)-^^B{s)fr{s)ds. 
JR 

Using integration by parts, we obtain 

c(7) = 7^^r[BB^]7 - 27^E,[B'] + Er[(t>\ 

where B' is the derivative of B and E^ indicates expectation under r. Thus, 
the optimal 7 is 7^ = (i?r[BB-^])^^£'r[B'] and the best approximation of 
in ^, in the sense of mean square error, is (j)g = 7JB. This method was 
proposed by Jin [17] as a variant of Cox's [10] penalized estimators. Given 
a random sample of size n from the density function r, can be estimated 
by combinations of empirical moments. So, a natural estimator of (j) is given 
by 

(2.15) 4>g = 7JB, where 7J = (4[BB^])"i4[B'], 

where E^ is empirical expectation corresponding to E^.. 

B-spline basis functions are popular choices for Q. In general, the support 
of r is unknown and we need to choose a working interval [b^,bn], in which 
knots are distributed, for the construction of the basis functions. The basic 
rule for adaptation is that [6„,6n] — *■ supp{r) very slowly as 00. Here, 
6„ and bn are selected as and 1 — a„ empirical quantiles where — > 0. 
The number of basis functions, say N, is an additional empirical smoothing 
parameter. One can use cross-validation to choose N as follows: 



(2.14) = vt^(j) + jrr^{x,w^^^)dp, 
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1. Split the sample randomly into two halves, say Xi and I2] 

2. For = 1, 2, . . . , use Xi to estimate 7^ € by (2.15), say 7(^(Ti), and 
use X2 to evaluate c(70(Ti)) empirically, but omitting the last term Er[4P'], 
say cx2\i^{N). Similarly calculate cj^ixjl-^); 

3. Select as the largest value such that |{cx2|Ji(-^) + Cx^\i2{^)} strictly 
decreases until A^. 

Jin [17] used a similar method in the i.i.d. case we have discussed, proved 
its validity and showed that N = 0{n^) under weak smoothness assumptions, 
where (5 G (0, 1/6) depends on tail properties of r. 

3. Asymptotic properties. We are given (e.g., the PCFICA esti- 

mate) such that for some e.„ > 0, 

(3.1) P{\\W^'''^ -Wp\\F<en)^l 

as n — > 00, where e„ satisfies e„ — > and \/nen — > 00. Let 

(3.2) J7„ = {VFGM™^"':||VF-T^p||F<en}. 

Define 4>-w^^^n{x) = (j)]v^{x)I{x G [hnk^buk]) and consider the following condi- 
tions for 1 < k,i,j < m: 

CI: Wp is nonsingular. 

C2: E[Sk] = 0, E[Sl] < 00, med{\Sk\) = 1 and E{(l)k{Sk))^ < 00. 
C3: |rfc|oo < 00, |r^|oo < 00, supig7^|tr^(t)| < cx3. 

C4: The uniform law of large numbers (ULLN) holds for {^vy^ {WkX)Xi : W G 
^^n}, Hlr^ {WkX)Xf -.Wenn} and for {<P'^^ {WkX)WiXX, :WGQn}. 
C5: For some positive Ci,C2, r^it) > Ci5nk if i G [&nfc,^nA;], otherwise rk{t) < 

C6: snvwmJ^WkAoo^nk = 0{l) and s\XY>wmnWwk,n\ooKk = o{l). 
_ii _ 

C7: {hnk - Kk) = o{l). 

[Note: ULLN holds for Qn iff sup^^g^ | / g{X) d{Pn-P)\ = Op(l); see, e.g., [24].] 
Conditions C1-C3 are simplified regularity conditions. CI and the finite 
moments in C2 are among the minimal regularity conditions for considering 
efficiency, as mentioned in Section 2.3. Setting the absolute median to unity 
in C2 is a simple and minimal condition to make the scales of the unmixing 
matrix identifiable [9]. It should be clear that the zero mean assumption in 
C2 is in no way crucial to the general argument as the mean can be estimated 
adaptively, but it serves to keep algebraic complication to a minimum. C3 
assumes some smoothness on the density score function for each hidden 
component, which is needed if it is to be well approximated by B-splines. 

Conditions C4-C7 are technical conditions that we believe are far from 
necessary, but they are reasonably easy to check and enable construction of 
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a more compact proof. As an easy example, if |i;^fc|oo < oo and |?'^'/'''fc|oo < co 
for k = 1, . . . ,m, then by (A.l) in the Appendix, sup^g^^ |(?!>vi/fe|cxD < oo and 
by (A.2), 

^^Psi„ I '^VFi, I oo < OO. Thus C4 holds. C5 and C6 require that the 
tail of Tfc be not too wiggly. C6 also implies that 6nk 0. C7 requires that 
the initial value be reasonably close to the truth and that the domain and 
the number of knots of the B-splines [i.e., = (bnk — b^ik)^nk ~ ^1 
grow so quickly that we lose control of the approximation to ^w- 
Here is the main theorem: 

Theorem 3.1. In the ICA model (1.1), if (3.1) and C1-C7 hold for 
k = 1, . . . ,m, i ^ k and j / k, then with probability converging to 1, the 
algorithm (2.14) has a limit W^°°^ and 

(3.3) V^vec{W^°^^ -Wp) = I~f^V^ jr{X,Wp,^p)dPn + op{l), 

where leg = f\*\*'^{X,Wp,^p)dP. That is, is Fisher efficient. Fur- 

ther, 

^/^VeciW'^^'^Wp^ - Im-^ra) AA(0, I^fj^) , 

where leff = / vec{^sA)vec{^sJi)'^ dP does not depend on Wp and M is given 
by (2.5)-(2.8), with s replaced by S . 

The proof of Theorem 3.1 is given in later sections and the Appendix. 

4. Numerical studies and some computational issues. Two groups of ex- 
periments are implemented to test the empirical performance of EFFICA. 
Data are generated from known source distributions listed in Table 2 with 
a known mixing matrix VFp^. The boundaries for B-spline approximation 
of the density score functions are taken as = max(q„,(0), gn(O.Ol) — A„,) 
and bnk = min((7„(l), g„(0.99) + A„), where qn{-) denotes the empirical quan- 
tile and A„ = c • yToglogn. We used c = 5 in the simulation. The number 
of knots is key for EFFICA and is chosen by the cross-validation method 
described in Section 2.5. 

In the first group of experiments, two hidden components are used and 
Wp = [2,1; 2, 3]. The two components in the first twelve experiments are 
i.i.d from one of the distributions [1]-[12], and the two components in ex- 
periments 13-15 are from different distributions as specified in cases [13]-[15] 
of Table 2. Each experiment has been replicated 400 times with n = 1000. 

In the second group of experiments, the number of hidden components is 
increased to m = 4, 8 and 12, m hidden components are chosen with distri- 
butions of [0], [1], . . . , [m - 1] in Table 2 and Wp 

— Imxm- The experiments 
are replicated 100, 100 and 50 times for m = 4, 8 and 12, respectively, with 
n = 4000. 
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Comparisons are made with five existing ICA algorithms: the FastICA 
algorithm with the options of "symmetric" and "tanh" [15] which is equiva- 
lent to quasi-ML by using a tanh distribution for each hidden source (note: 
the FastICA code uses logistic for tanh), the JadeICA algorithm [7], the 
extended infomax algorithm [19], the KernellCA-Kgv algorithm [3] and the 
PCFICA algorithm. The PCFICA's estimate is used as the initial value for 
both EFFICA and KernellCA-Kgv. Due to the existence of multiple local 
solutions, PCFICA uses three starting values, one from FastICA and the 
others random. The performance of each algorithm is measured by both 
the Frobenius error, that is, dF{W ,Wp) = \\WWp^ — ImxraWp after suitable 
rescaling and permutation on rows of both W and Wp, and the so-called 
Amari error dA{W ,Wp) (e.g., [3]), 

Zm ~^ \ maxj \ aij \ J 2m \ maxj | Cij \ / 

where V, W are rescaled into V, W such that each row of V and W has 
norm 1 and aij = {VW~^)ij. The Amari error lies in [0,m — 1], is invariant 
under permutation and scaling of the rows of V and W and is equal to zero 
if and only if V and W represent the same row components. 

For each experiment in the first group of simulations with T = 400 repli- 
cations. Table 3 reports the average Amari error and square root of the mean 
square error ^/MSE with 

(i) 

where dp denotes the Frobenius error for the ith replication. For the second 
group of simulations. Figure 1 shows the boxplots of the Amari errors and 
Table 4 reports VMSE. 

From the simulation results, in some cases, some parametric ICA algo- 
rithms work very well and even outperform EFFICA. For example, FastICA 



Table 2 

Source distributions used in the simulations 



[0]. 


N(0,1) 


[8]. 


exp(l) + 17(0,1) 


m- 


exp(l) 


[9]. 


mixture exp. 


[2]. 


t(3) 


[10]. 


mixture of exp. and normal 


[3]. 


lognormal(l, 1) 


[11]. 


mixture Gaussians: multimodal 


[4]. 


t(5) 


[12]. 


mixture Gaussians: unimodal 


[5]. 


logistic(0, 1) 


[13]. 


exp(l) vs normal(0,l) 


[6]. 


Weibull(3, 1) 


[14]. 


lognormal(l, 1) vs normal ( 0, 1) 


[7]. 


exp(lO) + normal(0, 1) 


[15]. 


Weibull(3, 1) vs exp(l) 
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T able 3 

1000 X mean Amari errors (and 1000 x V MSE in braekets) for six ICA methods using 
m = 2 sources and sample size n — 1000. Source distributions for row k are given by [k] in 
Table 1. The bold numbers represent the best performance according to each experiment 



pdf 


Fast 


Jade 


Extlmax 


Pcf 


Kgv 


EFFICA 


1 


37 


39 


34 


18 


14 


7 




(89) 


(66) 


(57) 


(31) 


(24) 


(11) 


2 


36 


36 


24 


35 


33 


29 




(231) 


(68) 


(61) 


(61) 


(55) 


(52) 


3 


33 


31 


19 


16 


14 


5 




(243) 


(69) 


(33) 


(27) 


(24) 


(8) 


4 


39 


50 


41 


60 


61 


60 




(99) 


(86) 


(112) 


(100) 


(102) 


(110) 


5 


71 


85 


87 


109 


99 


128 




(194) 


(153) 


(232) 


(192) 


(170) 


(253) 


6 


42 


43 


32 


18 


15 


7 




(188) 


(83) 


(57) 


(30) 


(24) 


(11) 


( 






oo 


Xo 


iO 


g 




(205) 


(72) 


(96) 


(31) 


(25) 


(16) 


8 


36 


44 


35 


21 


19 


17 




(99) 


(96) 


(64) 


(35) 


(31) 


(28) 


9 


35 


37 


24 


16 


14 


4 




(212) 


(83) 


(41) 


(28) 


(24) 


(7) 


10 


46 


59 


39 


44 


30 


47 




(209) 


(103) 


(66) 


(74) 


(49) 


(105) 


11 


28 


33 


27 


29 


25 


25 




(47) 


(54) 


(45) 


(48) 


(41) 


(42) 


12 


50 


49 


44 


44 


39 


78 




(184) 


(82) 


(78) 


(74) 


(66) 


(264) 


13 


65 


52 


185 


24 


19 


16 




(164) 


(89) 


(355) 


(42) 


(35) 


(31) 


14 


35 


45 


91 


20 


14 


11 




(109) 


(89) 


(188) 


(35) 


(24) 


(20) 


15 


69 


72 


57 


32 


27 


11 




(192) 


(184) 


(136) 


(69) 


(58) 


(25) 



works best in case 5 where hidden sources have logistic distributions. This 
is not surprising as we have pointed out in Section 2.1 that a simple quasi- 
MLE can outperform an efficient estimator when the value of the nuisance 
parameter used by the quasi-MLE is close to the truth. But, in most exper- 
iments, the parametric methods (FastICA, JADE, Extlmax) behave worse 
than the nonparametric methods (PCFICA, Kgv, EFFICA) and EFFICA 
has both the smallest Amari errors and smallest Frobenius errors, while Kgv, 
which we conjecture can be efficient after appropriate regularization, is the 
best in the cases of mixture Gaussians. The three nonparametric ICA al- 
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Table 4 





10 X ^/MSE for 


ICA alg 


orithms with the same 


simulations 


as in Figure 


1 


Case 


Fast 


Jade 


Extlmcix 


PCF 


Kgv 


EFFICA 


m = 4 


0.82 


1.31 


2.71 


0.60 


0.51 


0.45 


m = 8 


7.0 


8.3 


11.2 


5.4 


4.3 


3.6 


m = 12 


9.1 


11.2 


13.1 


7.5 


8.0 


8.7 



gorithms require heavier computation, but their performance is better than 
the parametric methods. 

All of the ICA algorithms used in the simulation except EFFICA are based 
on contrast functions which empirically measure the dependence or nongaus- 
sianity among {VFiX, . . . , WmX} and, thus, they are invariant with respect 
to the choice of Wp for both error metrics dpiW ,Wp) and dA{W,Wp). 
We note that prewhitening, which is used for data preprocessing by these 
algorithms, can reduce such invariance, although it does not cause incon- 
sistency [8]. Theorem 3.1 implies that EFFICA is asymptotically invariant 
with respect to Wp. Figure 2 compares m = 8 with two different unmixing 
matrices Wp = Imxm and Wp = Imxm + V, where Vjk = j/m? + {k — l)/m 
for 1 < j,k < m. We performed many other simulations with different Wp 
and obtained similar results. We observe that the Frobenius error boxplots 
do change somewhat with different Wp, but EFFICA is more robust than 
other ICA algorithms. We believe that the main reasons are (i) none of the ICA 
algorithms are convex and, thus, may suffer from local solutions and (ii) EF- 
FICA does not use prewhitening for preprocessing, while others do. 

5. Proof of Theorem 3.1. In this section, we prove Theorem 3.1. Note 
that solving (2.13) can be viewed as a generalized M-estimator (GM-estimator) 
The existence/uniqueness, convergence and asymptotic linearity of GM- 
estimators have been studied in [5] (the Iteration Theorem in Appendix A. 10.2, 
page 517). The idea of our proof is to use the Iteration Theorem. 

Suppose that M„(0,P„) is a functional oi 6 £ (a subset of a finite Eu- 
clidean space) and Pn, but is not necessarily linear in P„. The subscript n in 
M„ allows the existence of a possible smoothing or sieve parameter depen- 
dent on n. The zero of Mn{9, P„) w.r.t 9 is called a generalized M-estimator. 
Let M(9,P) = Moo{9,P). We review the conditions for the Iteration Theo- 
rem. 

[GMl] 9pGn is the unique solution of M{9,P) = in 

[GM2] Mn{9p,Pr,) = / Vep(^) dPn + Op(n-V2) ^^^^ ^ j^^^py 

[GM3] M{9, P) is differentiable w.r.t in a neighborhood of 9p and '^^^gg^'^^ 
is nonsingular. 
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F J X P t< E 

m=12 



Fig. 1. Boxplots of Amari errors for ICA algorithms: in each case {left, middle, right), 
m hidden components are generated from pdfs [0], . . . , [m — 1] in Table 1. The X-labels rep- 
resent ICA algorithms: F-FastICA, J-JadeICA, X-extended infomax, P-PCFICA, K-Kgv, 
E-EFFICA. The sample sizes are 4000 for all the experiments and the replication times 
are 100, 100, 50 for m — 4, m = 8 and m = 12, respectively. 




W=l W=l+V 



Fig. 2. Frobemus errors (xlO) with m = 8 and ?i = 4000, where the left panel is for 
W = Imxm ind the right panel is for W = Imxm + V , with Vjk ~ j/m'^ + (fc — l)/m. Each 
experiment is replicated 100 times. 
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For our efficient score equation Mn{0,Pn) =e„(VF) defined in (2.12), [5] 
condition [U] becomes 

[U] s^xvwm^<{W)-e{Wp)\ = op{l). 

Theorem 5.1 [5]. Suppose [GMl], [GM2] and [GM3] hold with 
Mn{9, Pn) = en{W) and that [U] holds. If the starting point satisfies P{\W^'^^ - 
Wp\ < En) 1, then with probability converging to 1, en{W) in (2.13) has a 
unique root which is also the limit of the sequence defined by (2.14), 

except with J 1*1*'^ {X^W-^^) dPn replaced by —eniW'^^'^), andW^^^ is asymp- 
totically linear with the influence function — [e(Wp)]~^l*{.,Wp,^p). 

Theorem 5.1 is called the Iteration Theorem in [5]. Note that the sequence 
limit defined by the Iteration Theorem uses the exact Newton-Raphson, 
whereas we use an approximate Newton-Raphson, as in (2.14). To make 
up the difference, we need the following condition [V] which is verified by 
Proposition 6.4 in Section 6: 

[V] snpwenjf^*^i*''iX,W)dPn-Jl*l*T{X,Wp,<^Wp)dP\=op{l). 
Theorem 3.1 can now be proved as follows: 

Proof of Theorem 3.1. It is obvious that [GMl] holds under the con- 
ditions of Theorem 3.1 as it is the efficient score function. [GM2], [GM3] and 
[U] are verified by Propositions 6.1, 6.2 and 6.3 below, respectively. Thus, the 
conclusion of the above Iteration Theorem applies here. By Proposition 6.4, 
the condition [V] holds. Further, by Proposition 6.2, 

e{Wp) = -E[l*l*^{X,Wp,<i>p)]. 

Thus, we have 



sup 

Wen,, 



en{W) + J H*^{x,W) dPn 



op(l). 



Then following the contraction arguments of [5] (pages 317-319), the itera- 
tion given in (2.14) has the same limit as that replacing / 1*1*-^ (X, W^^^) dPn 
by —en{W^^')), with probability converging to 1. Thus, (3.3) holds. The sec- 
ond result follows from (3.3) directly by using (2.4) and the devices of Kro- 
necker product and vec operator. □ 



6. Propositions 6.1-6.4. This section verifies conditions [GM2], [GM3], 
[U] and [V]. For convenience, we list all the notation used in the following 
proofs in Table 5, for k € {1, . . . , m} and W € Note that all of the lemmas 
used in this section are given in the Appendix. For simplicity of notation, 
we will often write 5nk as 
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Table 5 

List of all notation used in the proof 



P,Pn 


population, empirical law of X 


W,Wk,Wij 


m X m matrix, kth row, {i,j)ih element 


Wp,Wpk,Wp,j 


unmixing matrix, kth row, (i,j)th entry 




PDF of Sk 


<t>k = -r'k/rk 


density score function for Sk 


— (rh^ fh 


"Ti 1 n pi" 1 OTi "iT^ppi" r\r 


fWk 


PDF of WkX {fwp^, = Tk) 




score function of WkX {(t>Wp^ = 4'k) 


4>k,n,4'Wk,n 


truncation oi (j)k,(t)Wk on \b„,^,b„k] 


= , ■ ■ ■ , (t>W„t)'^ 


function vector 


'^W ~ {4>W-i ) ■ ■ ■ , 4'Wr,-i)'^ 


function vector 


r{x,w,^) 


efficient score function defined by (2.4) 


e{W) = J r{X,W,^w)dP 


expectation 


en{W) = J l*{X,W,^w)dP, 


empirical expectation 


P>{fc) _ Cr{*) D(fc) \T 
I->n — l,-"nl 1 • • • 1 -"life / 


B-splines defined on [b„fc,fenfe] 


A„iWk)^jBi^^Bi^^^{WkX)dPr, 


served in coefficients of 


D^iWk) = / {Bi^^^YiWkX) dPr, 


served in coefficients of 0^/^ 


7n(Wfc)=^n(Wfe)-'£'n(Wfe) 


served as coefficients of (jiw,. 


A{Wk)= J Bi^^Bk''^'^{WkX)dP 


served in coefficients of (pyy^ 


D{Wk)= J{Bi^^y(WkX)dP 


served in coefficients of 4>Wk 


'yiWk)=A{Wk)-^D{Wk) 


served as coefficients of (j>yy^ 




closed linear span of B-splines 




estimator of in C/i''' defined by (2.10) 


iw, ^liWkfBl*'^ 


estimator of (pw^ in C/i'^' defined by (A. 3) 



Proposition 6.1. Under the conditions of Theorem 3.1, we have 
en{Wp) = J r{X,Wp,^p)dPn + opin-'^^). 

Proof. Recall the definition of e„(VF) given by (2.12) and that of 
r(x, W, ^) given by (2.4)-(2.8). It is sufficient to show that for 1 < i / j < m, 
0(i — ai = op(l) and Pi — Pi = op(l), where {ai,Pi) and {ai,Pi) are defined 
in (2.7) and (2.11), respectively, and 

(6.1) J 4>WpAS^)Sj dPn = J (pWpAS^)Sj dPn + Op{n~^/^), 

where Si = WpiX, Sj = WpjX. 

The first two are not hard verify with the law of large numbers and 
Lemma A. 7. Here, we will just show (6.1). Observe that 

^ ^Wpi{Si)Sj dPn - j (t)Wpi{Si)Sj dPn = J [(pWpASi) - 4)Wp.{Si)]Sj dPj 
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+ 



+ 



Wp,('S'i) - (t)i,n{Si)]SjdP^ 
i)i{Si) - 4>i,n{^i))Sj dPn 



= [l] + [2] + [3]. 

In the following, we will show that all of [1], [2] and [3] are op{n~^/'^). 
First, by Lemma A. 2(7) and Lemma A. 3, 

[1] = / {ln{Wpi) - 7(VrpO)''B« {S,)Sj dPn 

< WlniWpi) - j{Wpi)\\2 J B»(5,)5, dP, 

= op{l)Op{n-'/^). 
_Further, E{[2]f = - c|)Wp,^n{S^))^E{S]). By Lemma A.6, 

\^Wp, - 4>Wp„n\oc < c6l\(l)'{^p. Joe- Thus, by C6, 

[2]=n~'/X\<Pwp.,n\ooOp{l)=op{n-'/'). 
For [3], since P{Si ^ [bm, &m]) — > 0, we have 



n 



So [3] =op(n-V2). □ 



Proposition 6.2. Under conditions CI, C2 and C4 of Theorem 3.1, 
e{W) is differentiahle w.r.t. W in a neighborhood ofWp and 

e{Wp) = -E{l*l*^ {X,Wp,^p)} 

is nonsingular. 

Proof. Let r^(-) = ^{</'i«(")} fo'^ ™y nonzero row vector w € M™'. 
By (A.l) in the Appendix, after exchanging the order of differentiation and 
integration, we have E[Tw{wX)] = 0. Then by (2.5), we have for A; = 1, . . . , m, 

El^-^^{r{x,w,^w)}wp] = El^-^^{r{x,w,^p)}wp]. 

Since the left-hand side of the above is e{Wp), by Lemma A. 8, the right- 
hand side is equal to 

(6.2) e{Wp) = -E{l*l*^{X,Wp,<i>p)}. 

Note that the elements of M in (2.5)-(2.6) are linearly independent, and 
that this is also true for the elements of l*{.,Wp,^p). Thus, e{Wp) must 
be nonsingular. □ 
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Proposition 6.3. Under the conditions of Theorem 3.1, for i,j,k 
1, . . . , m, we have 



(6.3) sup 

17 JT, 



(|)wAW^X)XkdPn{X)- / (l)p,{Wp,X)XkdP 



Op(l) 



and for j , 
f ^ 

r J m 



{^wAWiX)}WjXdPn 



d 



dW, 



{(l}Pi{WpiX)}Wp,XdP 



(6.4) 

= op(l). 
Thus, condition [U] holds. 



Proof. We omit the superscript (i) in i?n ^ henceforth. By the Cauchy- 
Schwarz inequahty, 

j B^^\WiX)XkdPn '< j \Xk\^dPnJ f2[Bnl{WiX)f dPn. 



1=1 



Since J2iLi[Bni{WiX)]'^ < 1 by property III of B-splines in the Appendix, 
we have 



(6.5) 



sup 



I BniW^X)XkdPn 



Op(l) 



and by Lemma A.2(7), supQjjniWk) - 7(W^fc)||2 = op(l), so 



sup 



cl)wAWiX)XkdPniX)- / (j)w(WiX)XkdPn 



(6.6) 



: sup 



hniWk) - l{Wk)Y j 'Bn{W^X)XkdP, 

Bn{WiX)XkdPn 



<sup||7„(VFfc) -7(Wfc)||2Sup 



= op{l)Op{l). 

Further, by Lemma A.6, suY)^J(i)y^^{WiX) - 4>w„n\oo < sup^^ c|(/>',^^ „|oo5^, 



so 



sup 



cl)y^^{WiX)XkdPn{X)- I (t>w.^n{W,X)XkdPn 

(6.7) <sup|</.'t^^,„|oo5' 

= op{l) (byC6). 
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By C4, ULLN holds for {(j)wAWiX)Xk:W £ ^n}, and by Lemma A.l, 
supo P{WiX ^ [b^i,bni]) = o(l), so 



(6i 



sup 



Op(l). 



Recall that (^^^.^nix) = <^wAx)^{.x G [^mi^m])- From (6.6)-(6.8), we obtain 



(6.9) sup 

£7 



4>wX^iX)x^,dPn{x) 

Now, by C4, 



hwX^iX)XkdPr,{X) 



op(l). 



(6.10) 

and by continuity, 



sup 



1 </.H/,(VFiX)Xfc(i(P„-P) 



op(l). 



.11) sup 

17 7i 



0(1). 



(6.3) then follows from (6.9)-(6.11). 

In the following, we prove (6.4). Note that 



It suffices to show that the following hold: 



[4] sup 



[5] sup 



cP'^XWiX)XkW,XdPn{X)- I cP'p,{WpiX)XkWpjX dP 
d 



op(i); 



dWi 



ik 



-{^i {Wi)]Bn{W^X)WjXdPn{X) 



Op{l). 



Similarly to (6.3), the uniform convergence of [4] can be verified using 
conditions C4, C6, C7 and Lemmas A.l, A. 2 and A. 6. Further, the left-hand 
side of [5] is bounded by 

d 



sup 



ik 



-ln{Wi\ 



Bn{WiX)WjXdPn 



Op(l) 



where the first equality follows from Lemma A. 2 and Lemma A. 4 and the 
second follows from C7. Thus, [5] holds and, hence, (6.4) is proved. □ 

Proposition 6.4. Under the conditions of Theorem 3.1, condition [V] 
holds, that is, 



sup 



rr' {X, w) dPn - / rr^ (x, Wp,^wp) dP 



Op{l) 
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Proof. By checking the elements of 1*1*"^ (x, W), it suffices to show that 
for 1 < k,k' <m and i ^ j, 



sup 



sup 



^W. {WiX)^W, {WjX)XkXk' dPn 

Pi{WpiX)cl)pj{WpjX)XkXk' dP 



op(l), 



<j)wAWiX)XkXk'dPniX)- / cl)piiWpiX)XkXk' dP 



op(l) 



and 



sup 



4>wAWiX)XkK{WjX)dPn{x)- / ^Pi{WpiX)XkK{Wp,x)dP 



Op{l). 



Each of these can be verified using Lemmas A.l, A.2, A. 6 and conditions 
C4, C6 and C7 with arguments similar to those used in proving (6.3). □ 



7. Conclusion. In this paper, we viewed the classical ICA model within 
the framework of semiparametric models and obtained an asymptotically ef- 
ficient estimator for the unmixing matrix by solving an approximate efficient 
score equation. The main difference between this new method and popular 
parametric ICA methods is that we estimate the density score functions 
of hidden sources adaptively. A variety of simulations have illustrated sta- 
tistical efficiency of this estimator in comparison with state-of-the-art ICA 
algorithms. 



APPENDIX 



A.l. Some useful formulas. Let v = wWp^ . Then wX = vS. Ifv^ ^0 for 
some k G {1, . . . ,m}, then by the classical convolution formula, we have 



fUt) 



-rk 



^k^j^j 



Vk 



Y[rj{sj)dsj = E 



Vk 



-n 



^ ~ J2j^kVjSj 

Vk 



Since (t) is a marginal density function of {vS, Sj :! < j ^ k < m), by a 
standard formula (see, e.g., [4]), 



(A.l) 



1 

Vk 



-E 



rk 



t - Y^j^kVjSj 



Vk 



vS = t 



1 

Vk 



E[cPk{Sk)\vS = t] 



and further calculation gives 
d 



(A.2) 



dt 



1 ^fr'^f t-E.^kVjSj 



Tk 



Vk 



vS = t 
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A. 2. Calculation of the efficient score. To formulate the tangent space 
defined in Section 2.1 for each nuisance parameter r^, by taking the smooth 
submodel {rj(-;t) = rj(-)e*^'(') : |t| < 1} for some hi G £'^{P), we have 

d 

lim ^{logpx(x, W,ri, . . .,ri{-;t), . . . ,r„,)} = /li(VFix). 

Since needs to be a probability density function which satisfies the 

mean and absolute median assumptions, hi needs to satisfy E[hi{Si)] = 0, 
E[hi{Si)Si] = and E[hi{Si)K{Si)] = 0, but is otherwise arbitrary. Thus, the 
tangent space for can be expressed as 

TSi = {hi{Wi^)€C^{P)\E[hi{S^)]=0,E[h,{S,)S,]=0,EMS^)K{Si)] = 0}. 

Note that the tangent spaces { TSi -1 <i < m} are perpendicular to each 
other since the Si are mutually independent. Thus, any projection onto 
the tangent space of (ri,...,r„^) is equal to the summation of the partial 
projection onto each TSi. The efficient score of W becomes 

m 

r(.,W^,$) = W-^7r(W|T5,), 
1=1 

where vr(.|L) denotes the projection operator in ,,^)) onto L. 

Since each off-diagonal entry of Imxm — 

^{S)S'^ is perpendicular to all TSi 
and each diagonal entry of it is perpendicular to all but one TSi, 1* can 
be obtained as in (2.4)-(2.8) of Section 2.3 by using the fact that TSj~ = 
span{l,Si,K{Si)}. 

A.3. Some properties of cubic B-splines. Let < ^2 < • • < '^TV be fixed 
points. The first order B-spline basis functions based on these knots can be 
expressed as B}{x) = I{x G i = l,...,A^ — 1, and the A;th order 

B-spline basis functions can be obtained recursively {k > 2) by 



^ TDk-lf\ , ^i+k X „fc_i , 

7 T^i [x) + 7 7 ^i+i ( 

Zi+k-l ~ Qi+k ~ Zi+l 



for i = 1, . . . , N — k. Each B^{x) is differentiable w.r.t. x up to order k — 2 
and its first order derivative can be expressed as 



fc-i. 



^B^{x) = ^ ^ B^-\x) ^—^B' 



X 



We use the 4th order, so-called cubic, B-splines {Bf -.1 < i < N — A} 
with equally spaced knots, that is, ^j+i — = 5 {i = 1, . . . ,N — 1) for some 
algorithm-determined 5. For simplicity, the superscript in Bf is omitted be- 
low. The following properties of cubic B-splines will be frequently used in 
proving the lemmas below (see [11, 23] for details): 
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(I) 0< B,(x) < 1, Bi{x)Bj{x) = if > 3. 

(II) \^B.(x)\<6-\\£b,{x)\<26-\ 
(HI) Elimx)? < 1, J:1,[£b,{x)]^ < 66-\ 



A. 4. Supporting lemmas for Propositions 6.1 6.4. In this subsection, we 
prove all of the lemmas used in the proofs of Propositions 6.1-6.4. Recall 
that for each (pi^ {k = l,. . . ,m), we have an interval [knkT^nk] and cubic 
B-spline basis functions defined thereupon using equally spaced knots on it, 
say bI^^ = {B^Ij , . . . , B^Jf,)'^ , as in Section 2.4. Thus, we have constructed 
a sequence of sieves Gn^ using as basis functions. For any W &Qn, we 

(k) 

have a class of estimates 4>Wk £ Gn for 4>Wki as defined in Section 2.4. For 

(k) 

convenience, the superscript of B„ is often omitted below. 

Let fii*^^ = {Wk : W G for = 1, . . . , m. We also need an intermediate 

~ (k) (k) 

approximation function ^^y^ GGn , defined as follows. For w G Qn , 
(A.3) ^^=7H^bW, 



where -f{w) = A{w)-'^D{w) with A{w) = J B^^'^B^n'^^ (wX) dP and D{w) = 
J[B^^]' (wX) dP . Note that the subscript w of cf)^ should always be associ- 
ated with r^rf ^ for some /l G {1, . . . , similarly for (pi^)- 

In the following, c denotes a constant (only dependent on the population 
law P) , but its exact value may vary in different places (even in a single line) 
without clarification. For a column vector x G M™, ||x||2 = Vx^x. For an m x 
m real matrix A, \\A\\i = maxi<j<m ||j4j||2, \\A\\2 = max^^jim ^^^^^^^^i \\Ax\\2 



and \\A\\f = ^Jtr{A^A), thus, \\A\\2 < \\A\\i. 

The following Lemmas A.1-A.8 hold under the conditions of Theorem 3.1. 
Jin [17] obtained results similar to Lemma A. 2 and Lemmas A.5-A.7 con- 
cerning the B-spline approximation, but in generally different settings. 



Lemma A.l. For sufficient large n, sup^g^(fe) |/^„|oo < oo, sup^^^(fe) |/4|oo 

0(1). 



Proof. The first two inequalities follow easily from C3. The remaining 
two are proved as follows. For any w G , \\w — Wpk\\2 l^^n- If we let v = 
wWp^ , then \vj\ ^ for 1 < j ^ k < m and ^ 1 as n — > oo. Since fwit) = 

t V - S ' 

-^[^^fe( we can consider the right-hand side as a function (say 
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h) of By the first order Taylor expansion, 



\fUt)-rk{t)\<en\\Wp^h{ E sup 



. j=iw£n 



_d_ 

dvj 



Kv) 



< CSr, 



where by direct calculation and using C3, \-^h{v)\ is uniformly bounded 

with w G r^n^^ and t G R. Recall that rk{t) > c6n for t G [bnk^bnk] and e„ <^ 6n, 
thus 

sup mill fw{t)>c6n- 

Finally, 

p{wXe[b^M) 

Since Snipnk-knk) 



_ fw{t)dt> / _ (rfc(t)-ce„)dt 

= P{Sk G [^„fc,5„fc]) — CSnipnk — knk)- 

: 0(1) andP(S'fc G [bnk,bnk]) ^ 1, we have inf^^^{fe) P{wX £ 



Recall the definitions of ^ly^., 'yn{Wk), An{Wk) and Dn{Wk) in Section 2.4 
and that of 7(w) = in (A. 3). The lemmas below give their 

uniform convergence rates. 

Lemma A. 2. The following holds for k,i = 1, . . . , m: 
(1) \\D{w)\\2 < c6nnj\- c5l < eig{Aiw)) < c<5„ for w G ^^n^ ■ 



(2) 






-2?HI|2 = Op(K loj 


?nfc)V2(^5„)-i/2) 


(3) 






|2 = Op(*„nP); 




(4) 






-AH||2 = Op((5„log 




(5) 






)ll2 = Op(5„-^); 




(6) 


s^^P«;eci^-) 


\ln{w)\ 


2 = Op(nP5-i); 





(7) sup (fe)||7„(u;) -7(u;)||2 = op(l); 

(8) SUp^(.)||^{yl„H}l|2 = 0p(<^n'/') 

(9) supJ.)||^{I)„M}||2 = Op(<5-2); 



(10) sup^w||^{7„(u;)}||2 = op(<5„^''^4/^). 

Proof. The procedure is as follows. First, (3) is implied by (1) and (2). 
Second, (6) is implied by (3) and (5). Third, since 

d d d 

^{7nH} = -^;:'M^{^nH}7nH + ^;;'H^{i?nH}, 
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(10) is implied by (5), (6), (8) and (9). Hence, it suffices to prove (1), (2), 
(4), (5), (7), (8) and (9). Further, the proofs of (2) and (4) are similar and 
the proofs of (8) and (9) are similar. Thus the proofs of (4) and (9) are 
omitted. 



Proof of (1) . By taking the derivatives of the cubic B-splines, B'^^{t) = 
5~^{B^^{t) — where B^^ are the third order B-splines defined on 

the same knots, z = 1, . . . , rifc, we have 



\Di(w) 



{Bl,{t)-Bl,_,,{t))U{t)dt 



BlAt){U{t)- Ut + 5n))dt 



< J Bli{t)dt\fl,\^<36n\fl, 



So the first result holds due to Lemma A.l. The second result follows from 
Lemma 5.1 of [17]. □ 



Proof of (2). Note that 

P( sup \\Dn{w)-D{w)\\2>t 

= P I sup 
<gp( sup 

For a fixed pair let J^n = = B'^ ,i{w'x) -.w G ^^n^}, a class of 

functions indexed by Then for w, v G 0,n \ \\gw{^) — 9v{^)\\2 < 2(5~^||7i; — 
f^||2||x||2 by property II of cubic B-splines. Now, the index set Qn can be 
covered by A'^ = ce^u~"^ balls of radius u and for any w,v in the same ball, 
— g(„(X)||2] < c6~^u. Further, by property I of cubic B-splines, 
^'^Pgn.eJ^Jdwloo < S~k - Then for cmax(5-^ e„(5-2) < a < c^, 

P I sup \/n 

This follows from Theorem 5.11 of van de Geer ([24], page 75) which gener- 
alizes Hoeffding's inequality and calculates the uniform convergence rate for 



J B[,{wX)d{Pn-P) 



>t 



J B'^4wX)d{Pn-P) 



> 



J B'^4wX)d{Pn-P) 



> a < exp(— ca (5„). 



26 



A. CHEN AND P. J. BICKEL 



a class of functions in terms of its size measure, or so-called bracketing en- 
tropy; see [24] for details. Note that e„ <C (5„ by C7, so for t > cn^J'^n~^^'^6~^ , 
we have 



P[ sup \\Dn{w) — D{w)\\2>tj <7ikeyip{—ct'^n6nnf,^). 

Thus, 



snp\\D.,{w)-D{w)h = oJJ^^^^^). □ 



,1 t^T), 



n6n 



Proof of (5). Since ^"^ = {A + An-A)-^ = A-^{I + {An-A)A-^)-\ 
by (1) and (4) (omitting the index w), we have 

sup P„ - ^||2P""^||2 = Op(l), 



hence, 



sup \\A-^\\2< sup \\A-%{l-\\An-A\\2\\A-'\\2)-^ = Op{5-''). 

Here, we use the inequality of matrix norm ||(/ + yl)~"^||2 < (1 — ||A||2)~^ for 
any square matrix A with ||^||2 < 1, where / is the identity matrix. □ 

Proof of (7). Since by (l)-(5), 

sup ||7„(?i;) -7(?i;)||2 = sup \\A-\Dn - D) - A'^Arr - A)A-^ D^h 



n / /rife log nfc 2 SnlogUk 



= 0P(1), 

where the last equality follows from C7. □ 



B«), 



Proof of (8). Note that the partial derivative is (omitting (i) from 
d 



{An{w)}jf = J {BnjBl,j, + Bl,jBnf){wX)XkdPn. 



dwk 

By the Cauchy-Schwarz inequality 



r 2 

J {BnjB'^^, + B'^jBr,f){wX)Xk dPn 

< J {Bn,Bl,j, + B'„jBnj'f{wX)dPn J XldPn. 
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Similarly, we can calculate the uniform convergence rate, 

sup sup / V + B[^.Br,jif{wX) d{Pn - P) = Op(l). 

0<ij'<n„|i-j'|<3,^gt^W^ 

Further, from Lemma A.l, sup ^(k)\fw\ is bounded so after algebraic ex- 
pansion, we have 

sup / {BnjB'^j, + B'^^Bnj'f{wX) dP < c6;;\ 



Thus, \-^jj:{An{w)}jj'\ < cSn^^"^. For cubic B-splines, Bnj{x)B'^j,{x) = for 

|j — j'\ > 3, thus, each row of has at most seven nonzero ele- 

ments. So, 



sup 



d 



dwi 



< sup 

2 n*') 



d 



dwi 



Op{6n 



□ 



Lemma A. 3. \\ jB'^\Si)Sj dP^h = Op{n-^/^), where Si = WpiX, 1< 
i ^ j < m. 

Proof. (We omit {i) in B^^^sJ*^ ^ 



nk ' 



E 



B„(S'j)5j dPn 



-E[Y.Bnk{S.fS]]<-E{S] 



□ 



Lemma A.4. snp^ J J B'ii\w,X)WjX dPnh = Op(e„(5^^n. ''^) for 1 < 

Proof. Similarly to the proof of Lemma A. 2(2), we have 
sup / B^{W,X)WjX d{Pn - P) 

n„ J 

Further, note that \Bnk{x) — Bnk{y)\ < ^n^\x — y\, so for any W G 

/ Bn{WiX)WjXdP 
J 2 

n 2 

= ^ / iBnk{WiX)W,X - B^kiWpiX)WpjX) dP 

k=i 

rii 

< J2 ('^n'^ll^llill^^^ - Wp.himh + EWxym - WpihY 

k=l 



Op\ 



I Slrii log 
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Thus, ^nvnAS'B^:\w,X)W,XdPnh = 0p{^[^^ + en5-^n]^^). □ 

Lemma A.5. E{C^w^{WiX) - ct>w.,n{W,X)f] < dlW^^^J^. 

Proof. Let d{4)Wi,niQn) = inf/ieg„ \4>Wi,n - h\oo- Then by the Jackson- 
type theorem [11], 

Wi,n\co- 

Thus, the result fohows from 

E{(^W.{W,X) - ct>w.,n{WiX)f} = mf E{{h{WiX) - <l}w^^n{WiX)f} 

<d{(l)W^,n,Qnf- □ 

Lemma A. 6. \(j}Wi-(t>Wi,n\oo < c^nl'/'w^nU; l-^iy, -<^Wi,nloo < c\(p'{^^^JocSn- 

Proof. By Theorem Xn.4 of de Boor (1978), there exists a quasi- 
interpolant with some a £ , 

4>w,{t)=JBn(,t), 

such that (j)^r. simultaneously approximates 4'Wi,n and its first derivative to 
optimal order, that is, 

~ -/ 

So, 

Combining this with Lemma A.5, we have 

E{^Wi - (i>W,f < E{ci)Wi - 4>Wi,nf + E{ci)w^ - (t)Wunf < c|(/''l^,,nlL'5n- 

Let coef{(j)^r_) and coef{(j)^/.) be coefficients of B„ in (j)^/. and ^yj/. , respec- 
tively. Then 

> Xn\\coef{^wJ - coef{(j)yi,J\\l, 

where is the minimum eig envalue oi A{Wi) = E[B„B^(WiX)]. By Lemma A.2, 
An > c6l. Thus, 

\(i>Wi -4'w,\oo < ||coe/((^,yJ - coef {(()y^^J\\2 < clfpw.^JooSl. 
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Hence, 

Further, by observing that |-B^^|oo < , we have 

\4>w, - (i>w,\oo < \\coef{(i)wJ - coef {4>wJhS~^ < c\(l)w^^JooSn- 

Thus, 

l0Wi ~ 4>Wi,n\oo < \4>Wi ~ 4'Wi,n\oo + \4'Wi ~ 4>Wi\oo < c\4>Wi,n\^^n- ^ 

Lemma A. 7. ^{^WpkiSk) - (t>k{Sk)f dPn = Op{l). 
Proof. Observe that 

^ y" {(l>Wpk{Sk) - (l>k{Sk)f dPn < J {4>Wpu{Sk) - (i>WpkiSk)f dPn 

+ j {(i^Wpki^k) - 4>k,n{Sk)f dPn 

+ J (t>k{Skfi{Ski[KAk\)dPn. 

The remainder of the proof is similar to the use of (6.1) in proving Propo- 
sition 6.1 by using Lemma A. 2 and Lemma A. 6. □ 

Lemma A. 8. Let {p{-,9, 'q):d £Q C M*^, rj £ £} be a parametric or semi- 
parametric model, where 9 is the parameter of interest. Suppose that moder- 
ate regularity conditions are satisfied and that l*{-,9,rj) is the efficient score 
function of 9, as defined in [5]. Then 

J ^r{X,9,rj)dP^e,v) = -J l*r^iX, 9, ri)dP(^e,v)- 

Proof. We only prove this for the parametric case £ C M"^. Let I{9,r]) 
be the information matrix of {9,r]). Then by classic likelihood theory (e.g.. 
Proposition 2.4.1 of [5]), r(-,6',r/) = li - (Ii2l^2^)(6', r?)i2. Here, ii and h are 
the partial derivatives of l{-,9,r]) = logp{-,9,r]) w.r.t. 9 and rj, respectively. 
Similarly, Ijj (i,j = 1,2) are defined as second order derivatives of l{-,9,r]) 
w.r.t. {9, 77). Thus, ^1* {X, 9, 7?) = In - (Ii2l2"2') (^, v)hi - ^{(Ii2l2"2') (^, v)}h ■ 
Since f\2{X,9,'q)dP(^0^^) = 0, we have 

j ^r{X,9,ri)dP^e,r,)= JliidP^g^,)-{li2l2i){0,v) JhldP(^e,r,)- 
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Since the information matrix satisfies lij = — J lij{X,9,ri) dP(^0 ,^'^, i,j = 1,2, 

the result fohows by / 1*1*'^ dP = 111 — Ii2l22^l2i (see Proposition 2.4.1 of [5], 
page 32). For the semiparametric case, the reader is referred to [5] for a 
generalization of this proof. □ 
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